library(tidyverse)
library(ggplot2)
library(lubridate)
library(ggrepel)
library(readxl)
library(gganimate)
library(gapminder)
library(gifski)
library(png)
library(directlabels)
As college students, we are well aware of the phenomenon of rising tuition costs. In recent years, the cost of attending college has skyrocketed at a rate quicker than inflation, outpacing most other CPI’s (Consumer Price Indices) and leaving many to wonder how they will afford to finance their education.
cpidata <- read_csv("Data/cpi.csv")
colnames(cpidata) <- c("date","TUITION","ALL_ITEMS","ENERGY","HOUSING","MEDICAL","FOOD_BEV","REC","TRANSPORT","EDUCATION","APPAREL")
cpi <- cpidata %>%
select(-REC,-EDUCATION)%>%
pivot_longer(cols=c("TUITION","ALL_ITEMS","ENERGY","HOUSING","MEDICAL", "FOOD_BEV","TRANSPORT","APPAREL"),names_to='cpi_type',values_to = "cpi_value")
cpi$date <- as.Date(cpi$date,"%m/%d/%Y")
cpigraph <- cpi %>%
ggplot(aes(x = date, y = cpi_value, group = cpi_type, color = cpi_type)) +
geom_line() +
labs(x = "Time", y = "CPI Value",title="Values of Major CPI Groups", subtitle = "Date: {frame_along}",caption="Data source: FRED, Federal Reserve Bank of St. Louis") +
theme_classic() +
theme(legend.position = "none") +
transition_reveal(date)+
geom_dl(aes(label = cpi_type,size=2), method = list(dl.trans(x = x + .2), "last.points"))
animate(cpigraph, width = 700, height = 425, fps = 15, duration = 35, rewind = FALSE, start_pause = 25, end_pause = 250)
The average cost of attending a 4-year university has climbed to over $20,000 for public colleges, and near $50,000 for private ones.
tuition_cost <- read_excel("Data/college_cost_over_time.xls", sheet = "Sheet1")
#head(tuition_cost)
tuition_cost %>%
ggplot(aes(x = Year, y = Amount, color = Length)) +
geom_line() +
facet_wrap(~Type) +
theme_classic() +
labs(title = "Average undergraduate cost of fees, room, and board rates in current USD", x = "", y = "")
Many students end up having to take out federal loans to help cover the up-front costs. This leads us to wonder: how are federal loans distributed, and who has to borrow the most for college? Which groups tend to borrow the most, and do they provide a good return on investment? Our research questions are stated below.
Our two groups of interest that we analyzed across were gender (m/f) and field of study.
On average, who can afford to pay more, and who borrows more to finance their undergraduate studies? (based on EFC)
Are there any differences in borrowing between these groups, for those with similar EFC’s? Is one group borrowing more/less than the other?
What major(s) provide the biggest increase in financial well-being post graduation?
Free Application for Federal Student Aid (FAFSA): The College Board states that, “The Free Application for Federal Student Aid is a form completed by current and prospective college students in the United States to determine their eligibility for student financial aid. It is the form you need to fill out to get any financial aid from the federal government to help pay for college. Each year, over 13 million students who file the FAFSA get more than $120 billion in grants, work-study, and low-interest loans from the U.S. Department of Education.”
Expected Family Contribution (EFC): According to the Federal Student Aid website, “The Expected Family Contribution (EFC) is a number that determines students’ eligibility for certain types of federal student aid. This number is calculated with the EFC formulas, which use the information that students provide on the FAFSA. Financial aid administrators (FAAs) subtract the EFC from students’ cost of attendance to determine their need for…federal student financial assistance offered by the U.S. Department of Education”. Among other things, two main factors that go into an EFC calculation are the student’s income and parent’s income.
uplift <- read_csv("Data/statuschanges.csv")
explorerace <- read_csv("Data/10_efc_borrow_race.csv")
exploregender <- read_csv("Data/10_efc_borrow_gender.csv")
Cum_amt_borrowed_2016 <- read_excel("Data/amt_borrowed.xlsx", sheet = "Sheet1")
efc_and_borr <- readxl::read_xlsx("Data/efc_and_amt_borrowed_over_time.xlsx")
salary_data <- readxl::read_xlsx("Data/11_8_data.xlsx", sheet = "salary")
salary_by_major_2018 <- read_excel("Data/salary_by_major_0818.xlsx")
statusdata <- read.csv("Data/morevarsupd.csv")
Our data files are compiled from the National Center for Education Statistics, which can be found here: NCES DataLab. The two studies we used separately in our work are:
In general, both studies compile various data on undergraduate students like demographics, undergraduate field of study, financial aid received, cumulative amount borrowed, student loan amount, and much more. The B&B study also includes data beyond the undergraduate level and tracks data like future salary post graduation.
We manually pulled and cleaned data sets relating to particular variables of interest, which are loaded in below. This was done using the NCES DataLab. Here you can select a study and analysis type (averages/medians/percents), and then select variables of interest and filters to generate a data table. We then downloaded this data as a .csv or .xlsx and cleaned it, leaving just the raw data for us to analyze. These files can be found and downloaded from our GitHub repository.
Note: When considering loans, aid, and EFC in our research, our sample size only includes undergraduate students who applied for or received federal loans to help finance college in the given years. This is only a subset of the total population that goes to college, and does not include those students who did not fill out the FAFSA. Thus, our findings should only be considered for this smaller subset of students reflected in our project.
We began by first exploring who typically has to borrow the most to pay for college. To do this we looked at the total amount students borrowed for their undergraduate studies, and differentiated by race and gender.
Cum_amt_borrowed_2016 %>%
ggplot(aes(x = fct_reorder(Race, Amount, .desc = T), y = Amount, fill = Gender)) +
geom_bar(stat="identity", position = position_dodge()) +
theme_classic() +
theme(axis.text.x = element_text(angle = 50, vjust = 1, hjust=1, size = 8.5, color = "black")) +
scale_fill_brewer(palette = "Paired") +
labs(x = "", y = "", fill = "", title = "Average cumulative amount borrowed for undergrad (2016, USD)")
This plot clearly shows that across the board, on average female students had to borrow more than male students. This could be partly due to the fact that female students represent a slight majority in college enrollment, which extends to private enrollment that is generally more expensive. Still, this plot shows a stark difference between the two genders in the amount borrowed for undergrad.
From this data we also observe that those who identify as “Black or African American” have the highest average amount borrowed, followed by those who identify as “White”.
efc_and_borr %>%
filter(`Field of study: undergraduate (10 categories)` != "General studies and other") %>%
filter(`Field of study: undergraduate (10 categories)` != "Undecided") %>%
ggplot(aes(x = fct_reorder(`Field of study: undergraduate (10 categories)`, `Cumulative amount borrowed for undergrad`, .desc = T), y = `Cumulative amount borrowed for undergrad`, fill = factor(Year))) +
geom_bar(stat="identity", position = position_dodge()) +
theme_classic() +
theme(axis.text.x = element_text(angle = 55, vjust = 1, hjust=1, size = 8, color = "black")) +
scale_fill_brewer(palette = "Blues") +
labs(x = "", y = "", fill = "", title = "Average Amount Borrowed for Undergrad")
efc_and_borr2 <- efc_and_borr %>%
filter(`Field of study: undergraduate (10 categories)` != "General studies and other") %>%
filter(`Field of study: undergraduate (10 categories)` != "Undecided")
males2 <- efc_and_borr2 %>%
filter(Gender == "Male")
females2 <- efc_and_borr2 %>%
filter(Gender == "Female")
males_ave_borr <- mean(males2$`Cumulative amount borrowed for undergrad`)
females_ave_borr <- mean(females2$`Cumulative amount borrowed for undergrad`)
data_hline1 <- data.frame(Gender = unique(efc_and_borr$Gender),
hline = c(males_ave_borr, females_ave_borr))
plot <- efc_and_borr2 %>%
ggplot(aes(x = fct_reorder(`Field of study: undergraduate (10 categories)`, `Cumulative amount borrowed for undergrad`, .desc = T), y = `Cumulative amount borrowed for undergrad`, fill = factor(Year))) +
geom_bar(stat="identity", position = position_dodge(), width = 0.8) +
facet_grid(. ~Gender) +
theme_classic() +
theme(axis.text.x = element_text(angle = 70, vjust = 1, hjust=1, size = 7, color = "black")) +
scale_fill_brewer(palette = "Blues") +
labs(x = "", y = "", fill = "", title = "Average Amount Borrowed for Undergrad")
plot +
geom_hline(data = data_hline1,
aes(yintercept = hline),
color = "orange")
#head(efc_and_borr)
efc_and_borr <- efc_and_borr %>%
mutate(ratio = `Cumulative amount borrowed for undergrad`/`Expected Family Contribution`)
efc_and_borr %>%
ggplot(aes(x = fct_reorder(`Field of study: undergraduate (10 categories)`, `Expected Family Contribution`, .desc = T), y = `Expected Family Contribution`, fill = factor(Year))) +
geom_bar(stat="identity", position = position_dodge()) +
theme_classic() +
theme(axis.text.x = element_text(angle = 55, vjust = 1, hjust=1, size = 8, color = "black")) +
scale_fill_brewer(palette = "Purples") +
labs(x = "", y = "", fill = "", title = "Average Expected Family Contribution")
males <- efc_and_borr %>%
filter(Gender == "Male")
females <- efc_and_borr %>%
filter(Gender == "Female")
males_ave_efc <- mean(males$`Expected Family Contribution`)
females_ave_efc <- mean(females$`Expected Family Contribution`)
data_hline <- data.frame(Gender = unique(efc_and_borr$Gender),
hline = c(males_ave_efc, females_ave_efc))
plot <- efc_and_borr %>%
ggplot(aes(x = fct_reorder(`Field of study: undergraduate (10 categories)`, `Expected Family Contribution`, .desc = T), y = `Expected Family Contribution`, fill = factor(Year))) +
geom_bar(stat="identity", position = position_dodge(), width = 0.8) +
facet_grid(. ~Gender) +
theme_classic() +
theme(axis.text.x = element_text(angle = 70, vjust = 1, hjust=1, size = 7, color = "black")) +
scale_fill_brewer(palette = "Purples") +
labs(x = "", y = "", fill = "", title = "Average Expected Family Contribution")
plot +
geom_hline(data = data_hline,
aes(yintercept = hline),
color = "orange")
On average we see that:
efc_and_borr %>%
ggplot(aes(x = Year, y = ratio, color = Gender)) +
geom_line() +
geom_point() +
theme_classic() +
scale_x_discrete(limits=2008:2016, labels=c(2008,"","","",2012,"","","",2016)) +
facet_wrap(nrow = 4, ncol = 3, facets = vars(`Field of study: undergraduate (10 categories)`)) +
labs(x = "", y = "Amount borrwed / EFC", color = "")
By 2016, in every major males have an equal or lower ratio of amount borrowed for undergrad compared to their EFC. This means that they are borrowing less money for college in relation to the amount their family was expected to pay for. In essence, females on average are having to borrow more even when accounting for EFC. A ratio of 1 means one borrowed the same amount of money as their expected family contribution.
prerank <- statusdata[order(statusdata$`Total.income..Parents.and.independent`,decreasing=TRUE),]
prerank$index <- 1:nrow(prerank)
prerank$time <- 2006
salaryrank2012 <- statusdata[order(statusdata$`Annualized.total.salary.for.all.jobs.in.2012`,decreasing=TRUE),]
salaryrank2012$index <- 1:nrow(salaryrank2012)
salaryrank2012$time <- 2012
salaryrank2017 <- statusdata[order(statusdata$`Gross.income.in.2017`,decreasing=TRUE),]
salaryrank2017$index <- 1:nrow(salaryrank2017)
salaryrank2017$time <- 2017
efc_sal <- rbind(prerank,salaryrank2012,salaryrank2017)
ggplot(data = efc_sal, aes(x = time, y = index, group = field_of_study)) +
geom_line(aes(color = field_of_study, alpha = 1), size = 2) +
geom_point(aes(color = field_of_study, alpha = 1), size = 4) +
scale_y_reverse(breaks = 1:nrow(efc_sal))+
theme_classic() +
labs(x = "Time", y = "Rank") +
theme(legend.position = "none") +
scale_x_discrete(limits=2006:2017, labels=c(2006,"","","","","",2012,"","","","",2017)) +
geom_text_repel(aes(label=field_of_study),
size=2.25,
box.padding = 0.5,
segment.size = 0.25,
color = "black")
loansdata <- statusdata %>%
select(`Cumulative.amount.borrowed.for.education.as.of.2012`,`Amount.owed.in.2009`,`Cumulative.amount.borrowed.in.federal.and.private.student.loans`,`Cumulative.loan.amount.borrowed.for.undergraduate.through.2007.08`,field_of_study)
rank2008 <- loansdata[order(loansdata$`Cumulative.loan.amount.borrowed.for.undergraduate.through.2007.08`,decreasing=TRUE),]
rank2008$amountborrowed <-loansdata$`Cumulative.loan.amount.borrowed.for.undergraduate.through.2007.08`
rank2008$index <- 1:nrow(rank2008)
rank2008$time <- 2008
rank2012 <- loansdata[order(loansdata$`Cumulative.amount.borrowed.for.education.as.of.2012`,decreasing=TRUE),]
rank2012$amountborrowed <-loansdata$`Cumulative.amount.borrowed.for.education.as.of.2012`
rank2012$index <- 1:nrow(rank2012)
rank2012$time <- 2012
rank2019 <- loansdata[order(loansdata$`Cumulative.amount.borrowed.in.federal.and.private.student.loans`,decreasing=TRUE),]
rank2019$amountborrowed <-loansdata$`Cumulative.amount.borrowed.in.federal.and.private.student.loans`
rank2019$index <- 1:nrow(rank2019)
rank2019$time <- 2019
loanscumulative <- rbind(rank2008,rank2012,rank2019)
ggplot(data = loanscumulative, aes(x = time, y = amountborrowed, group = field_of_study)) +
geom_line(aes(color = field_of_study, alpha = 1), size = 2) +
geom_point(aes(color = field_of_study, alpha = 1), size = 4) +
# scale_y_reverse(breaks = 1:nrow(loanscumulative))+
theme_classic() +
labs(x = "Time", y = "Cumulative Amount Borrowed") +
theme(legend.position = "none") +
scale_x_discrete(limits=2008:2019, labels=c(2008,"","","",2012,"","","","","","",2019)) +
geom_text_repel(aes(label=field_of_study),
size=2.25,
box.padding = 0.5,
segment.size = 0.25,
color = "black")
We also wanted to explore which majors were most worthwhile to go to college for financially, and provided the best salaries in the future. Below we did both both a short term and long term analysis of salaries by major.
Here we looked at the short-term value of college majors, and the average salary attained four years after graduation. Specifically, we analyzed an individual’s yearly salary in 2012 based on their field of study in 2008. Because this time frame is more recent after college, we included the most specific field of study descriptions for more clarity.
salary_data %>%
ggplot() +
geom_bar(aes(x = fct_reorder(`Field of study: undergraduate (23 categories)`, `Primary job: Annualized salary, 2012`, .desc = T), y = `Primary job: Annualized salary, 2012`), stat = "identity", fill = "skyblue", alpha = 0.7) +
geom_errorbar(aes(x = `Field of study: undergraduate (23 categories)`, ymin=`CI_low`, ymax=`CI_high`), width=0.4, colour="orange", alpha=0.9, size=1) +
theme_classic() +
theme(axis.text.x = element_text(angle = 52, vjust = 1, hjust=1, size = 5.5, color = "black")) +
labs(x = "Undergraduate Field of Study, 2008", y = "Salary, 2012")
We see that the top three majors from 2008 in terms of salary in 2012 are: engineering, computer science, and manufacturing/construction.
Here we looked at the longer-term value of college majors, and the average salary attained ten years after graduation. Specifically, we analyzed an individual’s yearly salary in 2018 based on their field of study in 2008. Because this time frame is a bit longer after college, we grouped the majors into more general fields of study for easier analysis.
#head(salary_by_major_2018)
salary_by_major_2018 %>%
filter(`Field of study: undergraduate (10 categories)` != "Undeclared") %>%
ggplot() +
geom_bar(aes(x = fct_reorder(`Field of study: undergraduate (10 categories)`, `Current job, as of B&B:08/18 interview: Annualized salary`, .desc = T), y = `Current job, as of B&B:08/18 interview: Annualized salary`), stat="identity", fill = "#69b3a2", alpha = 0.7) +
geom_errorbar(aes(x = `Field of study: undergraduate (10 categories)`, ymin=`CI_low`, ymax=`CI_high`), width=0.4, colour="orange", alpha=0.9, size=1) +
theme_classic() +
theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1, size = 7, color = "black")) +
labs(x = "Undergraduate Field of Study, 2008", y = "Current Salary in 2018", title = "")
We see that the top two majors from 2008 in terms of salary in 2018 are engineering and computer science, followed by business and physical sciences in third and fourth respectively.